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abstract 

Astrophysical research in recent decades has made significant progress thanks to the availability of various 
A-body simulation techniques. With the rapid development of high-performance computing technologies, 
modern simulations have been able to take the computing power of massively parallel clusters with more 
than 10^ GPU cores. While unprecedented accuracy and dynamical scales have been achieved, the enormous 
amount of data being generated continuously poses great challenges for the subsequent procedures of data 
analysis and archiving. As an urgent response to these challenges, in this paper we propose an adaptive storage 
scheme for simulation data, inspired by the block time step integration scheme found in a number of direct 
A-body integrators available nowadays. The proposed scheme, namely the block time step storage scheme, 
works by minimizing the data redundancy with assignments of data with individual output frequencies as 
required by the researcher. As demonstrated by benchmarks, the proposed scheme is applicable to a wide 
variety of simulations. Despite the main focus of developing a solution for direct A-body simulation data, the 
methodology is transferable for grid-based or tree-based simulations where hierarchical time stepping is used. 

Subject headings: methods; data analysis — methods; numerical — globular clusters; general — planets and 
satellites; dynamical evolution and stability — virtual observatory tools 


1. INTRODUCTION 

The gravitational A-body problem has posed a challenge 
ever since it was mathematically formulated in the 17* cen¬ 
tury by Isaac Newton. This problem of determining from 
initial conditions the future motion of A-bodies interacting 
gravitationally amongst themselves continues to be relevant 
in modern day astronomy and is investigated in the context 
of planetary systems, star clusters, galaxies and the Universe. 
Mathematically, it is posed as 3A coupled nonlinear second- 
order ordinary differential equations. The solution consists 
of the phase-space paths of all particles as functions of time, 
which generally cannot be expressed by algebraic expressions 
or integrals. 

Gravitational A-body simulations are currently the pre¬ 
ferred approach for finding these solutions. They use particles 
to represent gravitating objects and propagate the initial con¬ 
ditions in time by calculating the force acting on each particle, 
and advancing it in time in small steps. 

With advances in technology, simulations have become 
elaborate enough to take full advantage of computing capabil¬ 
ities. Especially the availability of highly-parallelized com¬ 
puting facilities, some using hardware accelerators (such as 
GRAPE (Makino & Taiji 1998)', FPGA^ boards (Berczik et 
al. 2007), and more recently GPUs^) have contributed to re¬ 
cent progress in the field. Simulations are carried out with 
more particles than ever before, and for longer integration 
times. Modem simulations are also characterized by the in- 
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elusion of more detailed physical processes and requirements 
for higher numerical accuracy. 

While modern powerful hardware has brought astrophysi¬ 
cal simulations to unprecedented accuracy, complexities come 
with the problem of storing the results, which is done by writ¬ 
ing to the hard disk some or all properties of some or all par¬ 
ticles at some pre-specified times; this is often called “tak¬ 
ing a snapshot”. The snapshot files can later be processed 
to learn about the evolution of the system. The storage re¬ 
quirement is determined by four factors; (1) the number of 
particles, (2) the size of the data record per particle, (3) the 
output frequency and (4) total integration time. In order to 
capture the detailed physical processes, high time-resolution 
of output often necessary. There is thus a tradeoff between 
time-resolution and output size (or the availability of storage 
space and post-processing capabilities). 

To illustrate this, consider some large cosmological simula¬ 
tions from the previous decade. The Millennium Simulation 
(Springel et al. 2005) followed about lO*" particles for nearly 
a Hubble time, and produced only 64 snapshots of about 300 
GB each. Similarly, the MareNostrum simulation (Gottloeber 
et al. 2006) used 2x10® particles and saved 135 snapshots of 
64 GB each (running for a similar physical time). Even almost 
a decade later, this data volume still poses a challenge for stor¬ 
age, and more crucially for transport over a network and for 
analysis. Thus, there is a gap between computing power and 
data processing and management capabilities. 

Direct summation techniques are often preferred when 
studying a system in which accurate orbital integration is 
needed and encounters are important, and/or physical assump¬ 
tions have to be minimized, such as globular clusters and plan¬ 
etary systems. In cosmological simulations, where the pri- 


2 


Cai et al. 


mary interest is to study the evolution of the large scale struc¬ 
ture, one often uses Tree methods (Barnes & Hut 1986) or Fast 
Multipole Method (Greengard & Rokhlin 1987), which are 
generally much faster (for large N) but introduce an approxi¬ 
mation to the force contributions from very distant particles. 
Direct A^-body simulations thus generally use a much smaller 
number of particles, nowadays rarely more than N = 10®. 

Big data management has so far been in the domain of cos¬ 
mological collisionless simulations due to the large number 
of particles. Despite the relatively small number of particles 
in direct A^-body simulations, data output can be a challenge 
for this kind of simulations as well. The key challenge is 
that they attempt to follow accurately phenomena which hap¬ 
pen on vastly different timescales: from white dwarf binaries 
which have orbital periods of less than one hour (Brown et al. 
2011) to stars orbiting in the outskirts of the cluster, which 
can take millions of years to complete one orbit. Calculat¬ 
ing the evolution of the entire cluster based on the smallest 
time step or timescale is completely impractical (see an es¬ 
timation in Section 3), so most productive codes employ in¬ 
dividual or hierarchical time step schemes like the Hermite 
scheme (Aarseth 2003). Saving the output, however, is usu¬ 
ally done using snapshots in much the same way as for the 
large cosmological simulations. 

The main problem with the snapshot approach is that no 
information is stored about what happens between snapshots. 
Interpolating will not always yield useful information if the 
process of interest occurs on a much shorter timescale than the 
snapshot interval. Examples of this are close encounters that 
may create hypervelocity stars (e.g. Yu & Tremaine 2003), 
resonances such as Kozai oscillations (e.g. Katz et al. 2011), 
evolution of planetary systems (e.g. Hao et al. 2013) or super¬ 
nova explosions. Those phenomena may be captured by the 
program and recorded separately, but there is no standardized 
way of doing so. For the same reason, it is difficult to make 
a smooth visualization of an energetically active subsystem. 
On the other hand, there might be redundant data for the dy¬ 
namically inactive particles. For this reason, Farr et al. (2012) 
proposed an adaptive approach of data output such that only 
recently changed data during the last output interval will be 
written to hies. They also proposed the Particle Stream Data 
Format (PSDF), a YAMF (Yet another Markup Fanguage)"^ 
based structured text format to ensure machine-independence 
and hexibility for most simulation data. 

Traditionally, snapshot hies are simple ascii hies contain¬ 
ing a table where rows represent the particles and the columns 
represent their properties; a header may have some additional 
information such as the snapshot time. This format is used, 
for example, by the phiGRAPE code (Harfst et al. 2007). 
While this scheme has some advantages being easy to process, 
human-readable, and machine-independent, it is not native to 
the machine representation of data and usually requires aux¬ 
iliary information to build up the structure, thus resulting in 
much less efficient storage and longer parsing time compared 
to binary formats. In contrast, binary formats store the same 
information in a more compact way using some common rep¬ 
resentation of numerical data (such as the IEEE754 hoating- 
point specihcation), and are preferred when large volumes of 
data are expected (e.g. the 0UT3 hie of NB0DY6, see Aarseth 
1999). There are, however, many binary formats for particle 
data, differing in how the data are arranged in the hie and how 
it is described by the metadata (see Section 5.1). Different 
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binary formats generally produce hies of similar sizes (espe¬ 
cially when the data volume is large), and with little statistical 
redundancy, and therefore cannot be further reduced in size 
by data compression algorithms. 

Big data must be written efficiently to the storage medium 
without interrupting or signihcantly slowing down the simu¬ 
lation process itself, thus often dedicated nodes or processors 
are used just to write the data to disk, while the others con¬ 
tinue the integration (asynchronous output). Some high per¬ 
formance I/O libraries, such as MPI-IO, allow multiple nodes 
or processors to write to the same hie in parallel. Beside the 
writing, some ways to deal with big data in this context are 
utilized as needed. If the data are sorted in a certain way this 
could make a snapshot hie smaller by not saving the particle 
ID. Alternatively, if the output of a tree code makes use of 
the space-hlling (Hilbert) curve, it is easier to rapidly access 
spatial sub-volumes of the data (Springel et al. 2005). An¬ 
other way is to more frequently output a subset of particles 
of interest, such as black holes (Berczik et al. 2005, 2006). 
Most importantly, to reduce the amount of data that needs to 
be saved at least part of the analysis is carried out “on the 
hy”. Regardless of the efforts of designing highly efficient 
data structures, some data processing on the Hy may in fact be 
necessary. An example for such analysis is the calculation of 
Fagrange radii in NB0DY6 (Aarseth 1999) and saving of image 
hies of the system in PKDGRAV (Jetley et al. 2008). 

In this paper we propose a scalable storage scheme for 
A-body simulation data using the HDF5 high-performance 
data format, because of its hierarchical nature that allows us 
to store time-evolving hierarchical systems such as globular 
clusters. This paper is organized as follows: Section 2 de¬ 
scribes the mode of operation of a direct A-body code; an 
adaptive storage scheme inspired by FaiT et al. (2012) for di¬ 
rect A-body simulation data and an analysis of data rate is pre¬ 
sented in Section 3; other possible approaches for data scaling 
are presented are presented in Section 4; technical concerns 
and benchmarks of the proposed scheme are presented in Sec¬ 
tion 5. Finally, applications of the proposed storage scheme 
are presented in Section 6. 


2. DIRECT W-BODY SIMULATIONS 

In the direct A-body scheme, the equation of motion for a 
particle of index i in a system containing A particles takes the 
form (Aarseth 2003): 


r, = 


V iw/r,- - r,) 

it |r/-r;P 


( 1 ) 


where mj are the masses of the other particles, A is the total 
number of particles, r are the positions, and G is the gravity 
constant. Full calculation of the mutual gravitational forces 
for a system of A particles corresponds to ~ A^ terms. The 
positions and velocities are updated subsequently by assum¬ 
ing that the evaluated force exerted on the particle is constant 
or can be interpolated with a polynomial during a certain time 
step At (see more information about the integrator below). 
When the time step At is shared among all particles, the to¬ 
tal number of calculations for TtotaU A-body time step is 

5 = -A(A-1) —. (2) 

2 At 

The choice of At varies among different integration algo¬ 
rithms. Employing higher-order algorithms allows faster con- 




Block time step storage scheme 


3 


vergence, but this requires additional computational effort to 
calculate the high order terms (Hut & Makino 2003). The 
fourth-order Hermite integrator was demonstrated to be suc¬ 
cessful in achieving an acceptable balance between accuracy 
and speed (Aarseth 1999). Nevertheless, in the dense central 
region of galaxies or star clusters, where close encounters may 
occur frequently, very small integration time steps still have to 
be taken in order to ensure accuracy, which will dramatically 
slow down the simulation. Close encounters will eventually 
cause tight binary systems to form: such systems need perma¬ 
nent treatment with extremely small time steps. In this almost 
inevitable scenario, should all integration points be saved, the 
sizes of the resulting data files would be overwhelming. For 
instance, in a globular cluster with N - 10^, if the system is to 
be evolved for 1000 Henon time units^ with At = 10“^ (which 
is relatively large; see histogram in Fig. 1), more than lO'^ 
bytes (1 exabyte) of data would be generated in total. This 
would pose great challenges even for modern storage arrays 
and for data analysis. 

As direct computation of the O(N^) algorithm is expen¬ 
sive, optimization schemes such as individual time step (ITS), 
Ahmad-Cohen neighbor scheme (ACS; Ahmad & Cohen 
1973) were developed to dramatically reduce the computa¬ 
tional costs (Aarseth 2003). Almost all modern A-body inte¬ 
grators now employ the ITS scheme. The basic idea is that 
since gravity follows an inverse-square law, particles from re¬ 
gions of different density experience different magnitudes of 
force. The density profiles of globular clusters can be roughly 
approximated by a power law, such as the Plummer model 
(Plummer 1911) or the King model (King 1966). Stars in 
the outskirts of star clusters usually move relatively unper¬ 
turbed for timescales comparable to hundreds of times the 
corresponding timescales of the central particles, and hence 
long time steps can be used for their integration. Stars in the 
central regions, however, frequently experience violent inter¬ 
actions (close encounters) with their neighbors and therefore 
require much smaller integration time steps. Integration of a 
particle with index i is therefore carried out using a time step 
Afj, which is often taken to be (e.g. Aarseth 2003) 


Ati = 


A 


a;||a®| H- |a,:| 


|a,'||a®| H- 




(3) 


where a, is the acceleration of particle i (the total force acting 
on it divided by its mass), a,, a® and a® are the first, sec¬ 
ond and third derivatives of the acceleration. The parameter rj 
controls the accuracy of the integration and a commonly used 
value is T} - 0.02 (Aarseth 2003). Depending on different 
density profiles, the ITS scheme reduces the computational 
complexity from 0(N^) to and a larger gain can be 

achieved with centrally concentrated systems (Makino & Hut 
1988). Here, a particle is considered as active if its state is 
changed significantly in time scales comparable to the inte¬ 
grator time step. Fig. 1 shows a time step distribution of time 
steps for systems with N - 8k, 32k and 128k (in this paper 
k = 2'0 = 1024). 

According to equation (3), the time step Af, of particle i can 
get an arbitrary value. In practice, however, in order to divide 
particles into groups according to their time steps, the block 
time step (BTS) scheme is often employed, permitting parti- 


^ The A^-body unit system is referred to here as the Henon unit system in 
honor of Michel Henon. 



Figure 1. Time step distribution for Plummer models realizations with 
N = 8k, 32k and 128k at r = 1 Henon time unit, simulated with the di¬ 
rect A/^-body code NB0DY6++. The code employs the Block Time Step (BTS) 
integration scheme and a fourth order Hermite integrator. For some arbitrar¬ 
ily defined maximum time step Atniax. all smaller time steps are given by 
Istn = A?niax/2”~b where n is called the “depth of integration”. In this figure, 
the time steps are in Henon units. The peaks of the three distributions are 
shifted to the left as N increases, illustrating that systems with higher number 
density have more close pairs, which lead to smaller time steps on average. 


cles in the same time step group to be advanced at the same 
time (Hayli 1967, 1974; McMillan 1986). Fig. 2 illustrates 
how particles are advanced in the BTS scheme. For instance, 
in the hierarchical scheme used by NB0DY6 and its parallel 
version NB0DY6++^ (Spurzem 1999; Spurzem et al. 2008) as 
well as many other Aarseth-type codes, the time steps are de¬ 
fined as 

Af„ =AWx/2"-', (4) 


where n is the level of integration and Af^ax is a predefined 
maximum time step (which in practice is often taken as one 
Henon time unit). 

The integration itself is often done using a predictor- 
corrector scheme. As an example, the Hermite Scheme em¬ 
ployed by NB0DY6-I-I- first predicts the positions x,_p(f) and the 
velocities v,;p(f) at some time f: 


X/,p(f) = x,-,o -I- (f - fo)v,',o + 


it - fo) 


■a/.o + 


(t-tor. 


-a/.o 


(f — fo)^ 

v/,p(f) = Vi.o + (f - fo)a,',o + —^—a ,'0 


(5) 

(6) 


where fo is the starting time and the subscript 0 of the vector 
quantities denotes the known value of the quantity at fo. The 
acceleration a,- and its first derivative a,- are evaluated at time 
f at the predicted position (i.e. by direct summation), and the 
two higher order derivatives of the acceleration can be evalu¬ 
ated at time fo: 


(2) ®i,0 ~ 3/,0 + 3; 

~ (f - to)^ t - to 


a 


(3) 

i,0 


1T 3,;o ~ 3; , 3;,o + 3; 

12-T +D-r 

(f - fo)2 (f - fo)2 


(7) 

(8) 


^ This paper makes no distinction between NB0DY6 and NB0DY6++. 
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Particles 



Figure 2. Schematic illustration of a 4-particle system integrated with a 
block time step (BTS) scheme. Pailicles are assigned individual 

time steps according to the forces exerted on them. It is assumed here that 
the total integration time is 1 (in ai'bitrary unit) and the minimum integration 
time is yi6. Alt = i/i6, no particle is scheduled to be integrated, as none have 
time step smaller than At = 2/i6. As the system proceeds to t = 2/i6, particle 
/ is the only particle with a time step short enough to schedule an integration. 
At r = 4 /i 6, particles are scheduled for integration while particle i is 

still outside the list. The full system is integrated at ? = 8/i6. Since after that 
particle j becomes increasingly active, it is integrated every time step starting 
from t = 12 / 16 . The BTS scheme assigns time step in a hierarchical fashion 
based on equation (4), and therefore guarantees that the commensurability of 
the individual time step of all particles. 


The second and third derivatives can be used to correct the 
predicted values to fourth order: 

(9) 

Av; = ^a®(f - tof + ^a®(f - fo)" (10) 

Finally, the corrected position x,(f) and velocity v,(f) at the 
time t can be expressed as 

Xi(t) ^ Xijt) + Axi ( 11 ) 

V;(0 = V;,p(f) + Av/ (12) 

Therefore, extra terms a,,o and ajo need to be stored, and in¬ 
terpolation also introduces extra computational overhead. For 
certain applications, such as visualization, it may not be crit¬ 
ical to compute the corrector terms, so part of the computa¬ 
tional and storage overhead can be further reduced. 

3. BTS STORAGE SCHEME 
3.1. Description 

Originally inspired by the ITS scheme, Farr et al. (2012) 
have shown that the data can actually be significantly com¬ 
pressed by recording only active particles. Below, we esti¬ 
mate the data rate of this approach by first considering the 
“traditional” snapshot scheme. During one Henon time unit, 
the number of data records produced by the scheme is 

Size(Snapshots) = 2^'N, (13) 

where Rt is the temporal resolution factor, such that in one 
Henon time unit, the output operation is triggered for 2*' 


times. In the BTS scheme. 


Rt-l 

oo 

Size(BTS) = V 2"A„ -H 2*' y N„ 

n=0 

n=R, 

Rt-l 

Ri-l 

= y 2"A„ + 2*' 

N- yNn 

n=0 

«-o 


where N is the total number of particles and Nn is the number 
of particles with time step At - 1/2". For a given R,, particles 
with integration time step Af, > 1 /2*' are fully resolved, in 
the sense that output is commensurate with integration. That 
is, whenever these particles are integrated (or in the terminol¬ 
ogy of the Hermite scheme, corrected) their data are written 
to the file; moreover, this happens only when integration is 
performed. The rest of the particles, which have Af, < 1/2*', 
are not fully resolved (for particles with time step 1 /2", output 
occurs only every 2"“*' integrations). 

In the snapshot scheme, since data from all particles are 
written at the same time, particles with integration frequen¬ 
cies lower than the output frequency have to be extrapolated 
(or in the terminology of the Hermite scheme, predicted); this 
is redundant, since analysis software can do this prediction, 
which is computationally very cheap. The BTS scheme com¬ 
presses the data of the fully resolved particles by eliminating 
redundant information, which is lossless. The BTS scheme 
compresses the data of particles with integration frequency 
higher than the output frequency by skipping a certain num¬ 
ber of integration points, which is lossy. 

Fig. 3 shows that the BTS file size initially grows exponen¬ 
tially as a function of Rt (like the snapshot scheme) but turns 
over at output frequency close to the peak of the time step dis¬ 
tribution (in Fig. 1) and saturates (so that the file size remains 
finite even when the output frequency grows to infinity, on the 
left of the figure). This saturation is due to the small number 
of particles with very small time steps, as seen in Fig. 1. Note 
that for very low output frequencies, the snapshot and BTS 
files have similar sizes, converging at f?, = 0 (which in this 
case represents the maximally allowed time step). 

The snapshot scheme is prohibitively expensive if one in¬ 
tends to resolve the most rapidly varying particles, but this is 
feasible due to the convergence property of the BTS scheme. 
On the other hand, for low output frequencies, the methods 
are equivalent in terms of number of records. The snapshot 
scheme might even be preferable in this case since the extra 
particle attributes ao and ao need not be stored. 

3.2. Example 

Consider as an example the 4-particle system illustrated in 
Fig. 2, where the system is integrated for one Henon time unit 
with minimum integration time step At - i/is; ^ temporal 
resolution factor of Rt -3 is adopted, which means that 2*' = 
8 output operations are scheduled within this one Henon time 
unit. The first output (not including f = 0) is triggered at 
t - i/s, when particle I is the only particle in the output list; 
at f = 2/8 however, particles (/, k, 1) are active and eligible for 
output. At f = 1 i/ifi. although particle j is integrated, no output 
occurs as it is not a product of the output time step 2 *' and 
an integer, therefore the information is lost here. It is instead 
included in the output list at f = ^yi6 - 6/8 alone with particle 
I, and only the latest data at f = will be written. At f = 
and t = 1/8 the system receives a full output. 

At any time the number of particles in the output list will not 
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Figure 3. According to the time step distribution of the N = 128k simulation 
in Fig. 1 at t = 1, these histograms show the number of output records as a 
function of output time step (in Henon units), using the BTS scheme (solid 
blue histogram) and the snapshot scheme (dashed red histogram). 


exceed the total number of particles. Consider again the above 
example while Rt = 2, more particles will be collected at the 
output points but the output interval is longer. Recall that for 
a full snapshot output, according to equation (13) the size of 
output is proportional to N, the output size of this scheme 
is a linear function of N. Statistics of the active particle frac¬ 
tions for simulations carried out with NB0DY6++ are presented 
in Fig. 4. The linear property of the BTS scheme makes the 
scheme suitable for very large systems, as long as the detailed 
evolution of the highly active particles is not important. For 
example, the resulting datasets can be used to generate visu¬ 
alization data for the overall evolution of star clusters. The 
datasets will have sufficient resolution to describe slow parti¬ 
cles in the outskirts of the cluster in detail, allowing the view¬ 
ers to observe the evaporation process. Since the output fre¬ 
quencies of highly active particles have been truncated to 2*', 
the datasets will only have enough resolution for these parti¬ 
cles if the output resolution R, is set to sufficiently high (so 
that the output time step is comparable with the actual inte¬ 
gration of those particles). Even in such case, the output size 
will still converge to a manageable scale as seen from Fig. 3. 

3.3. Interpolation 

Analysis or visualization software needs to interpolate the 
positions and sometimes velocities and higher derivatives of 
the particles between the output points. This is best done us¬ 
ing septic splines, which are seventh degree piecewise poly¬ 
nomials. This method ensures that the interpolated curves ex¬ 
actly touch at the endpoints (or the spline’s knots) and that 
the stored information about the previous and next known 
states is used. Lower order interpolation can be used based 
on the predictor-corrector scheme (Section 2) or by applying 
a lower order splines that would discard some stored informa¬ 
tion (i.e. the known values of a); higher order interpolation 
can be achieved if one uses more than the two nearest points. 
Let us define the running variable: 


Figure 4. Fraction of active particles averaged over one Henon time unit. 
Plummer systems with N = 16k, 64k, 256k and 1024k are evolved, and with 
each N five different temporal resolutions Rj coiTesponding to five different 
output frequencies ai‘e marked with different colors of lines. As the output 
frequency goes higher, the fraction of active particles declines, and therefore 
the BTS storage scheme gains significant reduction of data rates. For the 
same output frequency, systems with larger A have higher fractions of ac¬ 
tive particles. The two methods converge at Rt = 0, which yields standard 
snapshots. 

such that 0 < T < 1, where t is an arbitrary time in which we 
are interested in the particle’s properties; fo is the last integra¬ 
tion point before time f, and fi is the next one; At = t - to is 
the output time step. The interpolated position of the particle 
is thus: 

x(t) = P0-Hpir-HP2T^H-P3T^+P4T'‘H-P5T^+P6T® + P7T’ (16) 

where po ... p? are the spline coefficients given in equations 
(17) to (24). The expressions for the velocity and higher 
derivatives can be easily determined from the above expres¬ 
sion by derivation. Let the subscripts 0 and 1 represent the 
values of the quantities at times fg and fi respectively, so we 


can write: 

Po=Xo (17) 

Pi = VflAf (18) 

P2 = jaoAf^ (19) 

P3 = gaoAf^ (20) 

P4 = -i(4ao -Hai)Af^ - §(2ao-ai)Af2 

-5(4vo + 3vi)Af-35(xo-xi) (21) 

Ps = 5 (2ao + ai)Ap -i- (lOag - 7ai) Af^ 

h-3(15vo-h 13vi)Af h-84(xo-xi) (22) 

P6 = (4ao + 3ai) Af3 - i (ISa,, - 13ai) Af^ 

-2(18vo-h 17vi)Af-70(xo-xi) (23) 

Pv = g (ao -H ai) Af^ 4- 2 (ag - aO Af 

4- 10(vg -H Vi) Af-H 20(Xg - Xi) (24) 


Note that the discussion above is per particle, and that each 
equation represents three vector components, which for the 
purpose of the interpolation are completely independent. 

4. MODIFIED BTS STORAGE SCHEMES 




















































6 


Cai et al. 


As shown in Fig. 3, the BTS file size converges with in¬ 
creasing output frequency, and therefore this scheme becomes 
mandatory for the storage of simulation data when very high 
temporal resolution is required. Nevertheless, BTS scheme 
integration data of a very large simulation can still be too 
large even for moderately large systems. While equation (14) 
shows that that the data are scalable by specifying an output 
frequency /?,, this scaling technique may not provide sufficient 
resolution for highly active particles. An alternative scaling 
technique is presented in Section 4.1, allowing the user to de¬ 
fine an individual resolution for each particle. While dynam¬ 
ically active particles are assumed to be interesting particles. 
Section 4.2 explores some possible scenarios where this may 
not necessarily be the case. Finally, Section 4.3 discusses an 
even more generic scenario in which the output may be driven 
by physical processes other than the dynamical evolution. 

4.1. Scaling with Spatial Resolution: Triggered Output for 
Significantly Updated Particles 

In this scheme, the output is triggered per particle: when¬ 
ever an individual particle has been integrated Rs times, its 
data is eligible for output. Since Rs defines the portion of in¬ 
tegration for output, it defines the resolution in a spatial man¬ 
ner: larger values of Rs correspond to lower spatial resolution, 
and Rs - 1 corresponds to a full output of all BTS integration 
data. For Rs > I, the scheme skips Rs-l integrations right af¬ 
ter the current output until the next one, reducing the data rate 
by a factor comparable to Rs (the time steps of all particles 
are changing as they move in phase space and therefore it is 
unlikely that the reduction factor is exactly Rs). The total out¬ 
put rate is proportional to the total number of individual time 
steps, and the resulting output size grows as (Makino & 
Hut 1988). 

For instance, consider again the 4-particle system illus¬ 
trated in Fig. 2. Assuming that Rs = 2, the scheme skips one 
output right after the current output, so particle i will be only 
eligible for output at f = 1; particle j is eligible for output at 
t - and so on. With a careful choice of Rs, 

the scheme yields datasets with sufficient resolution for highly 
active particles such as hard binaries and close encounters, but 
also reduces the data rate of slow particles. Full output corre¬ 
sponds to ~ 1000 records per particle orbit (on average); for 
rendering purposes it is still sufficient to reduce this by one 
order of magnitude, and hence reducing the storage consump¬ 
tion by one order of magnitude. A comparison of the file sizes 
for different values of Rs is presented in Table 1. It is sensi¬ 
ble to apply this scheme for detailed follow-ups of energetic 
subsystems. 

4.2. Dedicated Output for the Particles of Interest 

Binary and triple black holes in galactic nuclei or star clus¬ 
ter centers, hypervelocity stars and the host stars of planetary 
systems are particularly interesting objects to investigate in 
simulations. The output module of the integrator should there¬ 
fore accommodate this need by providing high-resolution out¬ 
put for the particles of interest (POIs) while suppressing the 
output of uninteresting particles to achieve maximum storage 
efficiency. POIs can be dynamically active. For example, a 
binary black hole system in a galactic nucleus can be so dy¬ 
namically active that it will take a significant fraction of the 
wall-clock time to resolve even with advanced regularization 
technique (e.g. KS regularization). However, the bouns of 
these computations is that they keep the data of those active 


Table 1 

Spatial output resolution Rs as a function of number of records. 


Rs 

# of records (w/ BTS) 

# of records (w/o BTS) 

Efficiency ratio 

50 

112128 

536870912 

4788.0 

40 

223934 

536870912 

2343.0 

30 

272236 

536870912 

1972.1 

20 

404532 

536870912 

1327.1 

10 

766584 

536870912 

700.3 

1 

6953525 

536870912 

77.2 


Notes. The simulation was carried out with NB0DY6++ for one Henon time 
unit (roughly 1 Myr) and N = 16384 paificles, where BTS is employed. 
Rs = 1 corresponds to the full output of BTS data. The smallest time step of 
anN = 16384 Plummer system is of the order of At ~ 2“*^, according to the 
time step distribution given by Fig. 1. Hence, should there be no BTS scheme, 
the total number of record is proportional to At^At ~ 5.3 X 10® according to 
equation (2) (shown in column 2). A full output of BTS data already yielded 
an efficiency ratio (column 2 divided by column 1) of 77.2, and together with 
the Rs parameter the reduction can be promising. 

particles up-to-date all the time, allowing the output module 
to simply dump the data without interpolation. On the other 
hand, it may be interesting to follow the evolution of hyper¬ 
velocity binary stars in the outskirts of the cluster (e.g. Lu et 
al. 2007). The forces exerted on those objects change rather 
slowly, despite their high velocities. Consequently, the in¬ 
tegrator will not integrate those objects frequently; reliable 
dynamical data can then be achieved with interpolation, for 
example, with the spline method presented in Section 3.3. 

4.3. Events/Attributes Driven Output 

The output scenarios previously discussed are all driven by 
the dynamical evolution of the simulated system. Output will 
be triggered when the coordinates of particles change signifi¬ 
cantly. Sometimes, however, it is necessary to have the output 
triggered by certain events and/or attributes. For example, in 
starburst galaxies, the star formation process is usually the 
most interesting process to investigate. Critical events of stel¬ 
lar evolution may not necessarily correspond to critical events 
of the dynamical evolution. Therefore, output strategy should 
instead be driven by the stellar evolution process, allowing 
the follow-up data analysis to trace the evolution of such as- 
trophysical processes. 

Gravitational dynamics codes generally provide dynami¬ 
cal information of the particles such as positions, velocities 
and accelerations. Some codes support simulation of mul¬ 
tiple astrophysical processes simultaneously. For example, 
NB0DY6-I-I- is able to take the feedback of stellar evolution into 
account while handling the dynamical processes of particles. 
For direct A-body code with stellar evolution, possible assign¬ 
ment of astrophysical quantities for individual particles could 
be tabulated as in Table 2; binary systems are very common 
in such simulations, which requires auxiliary data structure to 
describe their properties as a whole. A possible binary data 
structure is presented in Table 3. 

5. TECHNICAL CONCERNS AND BENCHMARKS 

Even on modern supercomputers, a physically realistic sim¬ 
ulation may take months to run. It is therefore critical to store 
the output such that it is accessible for further analysis. This 
requires that data be stored in a well behaved, high perfor¬ 
mance data structure. Generally, a simulation data file should 
meet the following requirements: 

• Accuracy: correctly recording the relevant quantities; 






Block time step storage scheme 


7 


Table 2 

Astrophysical quantities of individual particles in a direct A^-body 
simulation with stellar evolution 


Quantity 

Meaning 

Category 

/ 

Unique identifier of the particle 

Miscellaneous 

name 

User friendly label (e.g. for visualization) 

Miscellaneous 

t 

Current time 

Miscellaneous 

6t 

Next time step 

Miscellaneous 

m 

Mass 

St. dyn. & evo. 

X 

Position vector 

Stellar dynamics 

X 

Velocity vector 

Stellar dynamics 

a 

Acceleration vector 

Stellar dynamics 

a 

Jerk vector (first derivative of a) 

Stellar dynamics 

p 

Neighbor density 

Stellar dynamics 

‘p 

Local potential 

Stellar dynamics 

^EV 

Stellar evolution age 

Stellai' evolution 

^STAR 

Type indicator of star 

Stellar evolution 

L 

Luminosity 

Stellar evolution 

R 

Radius 

Stellar evolution 

Teff 

Effective temperature 

Stellar evolution 

z 

Metallicity 

Stellar evolution 

6m 

Mass change during /ev 

Stellar evolution 

^CORE 

Core mass 

Stellar evolution 

^CORE 

Core radius 

Stellar evolution 


Table 3 

Astrophysical quantities of binary systems in a direct A-body simulation. 


Quantity 

Meaning 

^ 1 , h 

Unique identifiers of the two particles 

p 

Orbital period 

A 

Semi-major axis 

e 

Eccentricity of the binary orbit 

I 

Orbital inclination 

luh 

Inclinations of the spins 

Xc 

Position vector of the center of mass 

Xc 

Velocity vector of the center of mass 


Notes. Individual properties of each component can be retrieved by referring 
to Table 2 with A and ii- 

• Time efficiency: data are written faster than they are 
generated, and the simulation is not slowed down sig¬ 
nificantly due to data output; 

• Space efficiency: redundancy minimized; 

• Interchangeability: machine/OS independent; 

• Scalability: scalable to simulations big and small, sim¬ 
ple and complicated; and 

• Robustness: data loss minimized when the file is cor¬ 
rupted. 

Datasets of A^-body simulations are designed to describe 
a time-evolving system. Since inactive particles are not 
recorded, the interpolation of their data requires knowledge 
of their previously active state, making the dataset itself time- 
dependent. Hence, guarantee of data consistency would be 
another requirement. 

5.1. Choosing a File Format 

File formats are roughly divided into two categories: ascii 
files and binary files, ascii files are generally easier to inter¬ 
pret and are human-readable. For example, tabular data are 


often stored as CSV (comma-separated values) files. Since the 
CSV data format simply uses one delimiter character (e.g. a 
comma) to separate fields, and uses a line break to indicate 
the termination of a record, it is widely supported and can be 
easily imported to an analysis program. More complicated 
ASCII or text formats, such as XML^ and YAML^ also have been 
developed and standardized, making text files capable of de¬ 
scribing hierarchical data structures. Text files avoid some of 
the problems encountered with binary files, such as endian¬ 
ness, padding bytes, and differences in the number of bytes 
in a machine word. However, they are not native to computer 
systems. Indeed, representation of numerical values in a text 
file is just literal, as these values are merely ascii sequences, 
and have to be converted to their intrinsic values before any 
computation can be performed. Standardized text formats, 
such as XML, structure the data with tags, which contributes 
to its low entropy. 

In contrast, high I/O throughput are usually achieved with 
binary files, since they are byte sequences native to the ma¬ 
chines. High level binary file libraries have been developed to 
resolve the problems of endianness, padding bytes, file head¬ 
ers, metadata storage, block data storage, etc. The HDF5^ 
(Hierarchical Data Format, version 5) for example, is an im¬ 
plementation of a binary file standard dedicated to handling 
large volumes of numerical data. It offers rich features such 
as compression filters, checksum filters, chunking, partial I/O, 
parallel I/O and caching. It allows the data to be structured in 
a hierarchical fashion and being accessed using POSIX-like 
path syntax. While the HDF5 format is designed for general 
purpose numerical data storage, some higher level applica¬ 
tion programming interfaces have been developed to fit into 
special applications. For example, HSPart (Adelmann et al. 
2008) is a portable high performance parallel data interface 
for HDF5, which is dedicated for the storage of particle-based 
simulation data. Other widely used binary file formats in as¬ 
trophysics include CDF*** (Common Data Format), NetCDF" 
(Network Common Data Format) and FITS'’ (Flexible Im¬ 
age Transport System). All these formats are self-describing 
and machine-independent, optimized for scientific data. The 
FITS data is mainly designed for image data as its name in¬ 
dicates, and the image metadata is stored in human readable 
ASCII head, allowing an interested user to easily examine the 
header information with a simple text editor. CDF and NetCDF 
are more general data formats. Originally they share the same 
conceptual model based on a multidimensional (array) model, 
but the latter has since diverged and is not compatible with 
the former. Some data formats are developed and optimized 
for more specific applications. For instance, the SDF format 
(Warren 2013) is used in the oct-tree based “Dark Sky” cos¬ 
mological Simulations (Skillman et al. 2014). 

5.2. Benchmarks 

We adopt HDF5 as the native output format for the direct 
A-body code NB0DY6 and NB0DY6-H-, due to the rich features 
it offers and especially its interchangeability within the as¬ 
tronomical community. For example, the GADGETZ'^ code 

’ http: //WWW .w3.org/XML/ 

* http: //WWW .yaml.org/ 

* http: //WWW .hdfgroup.org/ 

http://cdf.gsfc.nasa.gov/ 

*' http: //WWW .unidata.ucar.edu/software/netcd£/ 
http://fits.gsfc.nasa.gov/ 
http: //WWW. mpa-garching.mpg.de/gadget/ 
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(which was used, among others, in the Millennium Simulation 
mentioned above) has an options to output its snapshot data 
in HDF 5 format; the Low Frequency Array (LOFAR) chose to 
use HDF 5 to manage astronomical radio data (Anderson et al. 
2011). The internal file layout is structured with the HSPart 
scheme (Adelmann et al. 2008). For the purpose of bench¬ 
marks, we also store the data as plain text CSV files, in which 
the particle data is described by multiple columns separated 
by commas, and each line describes the full data for a parti¬ 
cle. Each floating point number takes 8 bytes. Since the CSV 
format is not hierarchical, the time variable for particles in the 
same time group is repeated many times, as shown below: 


tl, 

1, 

xl , 

yi. 

zl , 

vxl , 

vyl , 

vzl 

tl, 

2 , 

x2 , 

y2, 

z2 , 

vx2 , 

vy2 , 

vz2 

tl, 

n, 

xi , 

yi. 

zi , 

vxi , 

vyi , 

vzi 

t2 , 

1, 

xl , 

yi. 

zl , 

vxl , 

vyl, 

vzl 

t2 , 

2 , 

x2 , 

y2, 

z2 , 

vx2 , 

vy2 , 

vz2 

t2 , 

n, 

xi , 

yi. 

zi , 

vxi , 

vyi , 

vzi 


The output subroutines can be easily integrated into recent 
versions of NB0DY6 and NB0DY6-i-i-'^, in which option #46 and 
#47 of the input file are reserved for controlling the output file 
type (HDF 5 or CSV) and output frequency, respectively. De¬ 
tailed instructions for the installation and usage can be found 
in Appendix A. 

The output file sizes of the binary HDF5 output and text CSV 
output are compared in Fig. 5, and the corresponding wall- 
clock time overheads are shown in Fig. 6. It is obvious that 
even for very small systems, the performance difference be¬ 
tween HDF5 files and CSV files is well pronounced: the file 
sizes of CSV are generally larger than the corresponding file 
sizes of HDF5, as more data are repeated as meta-data in the 
CSV format. As N increases, they also grow faster than HDF 5. 
The overhead of HDF 5 is negligible even for high frequency 
output, but the overhead of CSV is significant. Fig. 7 shows 
the growth of the file size (a cluster simulation of one Henon 
time unit) as a function of particle number; it also compares 
the file size dependency on different output frequencies. It 
shows that at lower output frequencies, the data size grows 
linearly as a function of N, while at high output frequencies 
(corresponding to 2^° = 1024 outputs per henon time unit), 
the BTS scheme saves a significant fraction of the data rate. 

With the scheme described in Section 3.2, Fig. 7 shows that 
the size of output data scales linearly with the total number of 
particles, thus achieving very high space efficiency and suit¬ 
able for long-term simulation of very large systems. This 
scheme may not be able to provide sufficient resolution for 
highly active particles, as it treats all particles equally. In fact, 
highly active particles can be well resolved by the scheme 
described in Section 4.1, where the resolution of less active 
particles is sacrificed. 

6. APPLICATIONS 

As noted in Section 4, a reasonable tradeoff between out¬ 
put size and loss of information can be achieved by taking 
the most interesting astrophysical processes into account and 
then adapting R, or R, for data scalings, which makes various 
applications possible. The resulting datasets can be used for 

http: //WWW . ast.cam.ac.uk/"sverre/web/pages/nbody.htm 



Figure 5. The size of output data file for one Henon time unit as a function 
of total particle number and temporal resolution (64, 256, 512 and full output 
of the BTS data). The solid lines correspond to the output file size of when 
the output data are written in HDF 5; the dashed lines correspond to the output 
file size when the output data are written in CSV format. With the increments 
of particle number and output frequency, the CSV file size grows quickly due 
to its large redundancy of meta data. 



Figure 6. Total wall-clock time as a function of total particle number and 
temporal resolution (64, 256, 512 and full output of the BTS data) for evolv¬ 
ing very small systems (N = Ik, 2k, 4k, 8k) for one Henon time unit. The 
solid lines correspond to the output file size of when the output data is writ¬ 
ten in HDF 5; the dashed lines correspond to the output file size when the 
output data are written in CSV format. This measurement is performed on 4 
Intel(R) Core i7-3630QM CPU cores with the HDF5 library 1.8.11 (without 
GPU). Even for such small systems, the performance diflFerences of HDF5 and 
CSV are still well pronounced. The overhead caused by the HDF 5 output rou¬ 
tine is negligible, while the corresponding overhead caused by the CSV output 
routine grows quickly. 

post-simulation processing such as data visualization or data 
mining; they can also be used to store intermediate simulation 
data in large scale simulations. 

6.1. Simulation of Planetary Systems in Star Clusters 

The BTS scheme opens new approach for A-body simu¬ 
lations involving hierarchical architectures. The stability of 
planetary systems in star clusters, for example, is of funda¬ 
mental importance in understand the early stage of planet for¬ 
mation. In fact, star clusters are the building blocks of galax¬ 
ies (Lada & Lada 2003). Star formation are believed to be in 
clusters as giant molecular clouds collapse. The collapse will 
likely result to circumstellar discs, which are the progenitors 
of planetary systems. Should planetary systems be formed 
originally in the star cluster, at least some of them would have 
been stable enough to survive in the star cluster environment, 
where the densities are normally much higher than the solar 
neighbor and close encounters are not rare. The Kepler mis- 
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Figure 7. The size of output data file by defining temporal resolutions of 8, 
16, 32, 64 and 1024 outputs per Henon time unit. The output data are written 
in the HDF5 binaiy format. This gives a quick estimation of the simulation 
data size: for example, following up the evolution of a typical N ~ 100k 
system for 1000 Henon time units with R[ = 3 (i.e. 8 outputs per Henon time 
unit) coiTesponds to 100 GB of data size, which can be easily managed, even 
on personal computers. 


sion has been greatly successful in hunting exoplanets, yet it is 
worth mentioning that there are only a few exoplanets discov¬ 
ered in star clusters (e.g. Kepler- 66 , Kepler-67, see Meibom 
et al. (2013)). This dichotomy is likely due to the post for¬ 
mation disruptions of planetary systems in star clusters. This 
problem has been tackled in the previous studies with both 
direct A^-body simulations (e.g. Spurzem et al. (2009)) and 
Monte-Carlo Simulations (e.g. (Hao et al. 2013)). Neverthe¬ 
less, due to the huge range of both dynamical time scales and 
spatial scales, it is currently only feasible to investigate single 
planetary systems by treating the star-planet pairs as binaries 
and employing a regularization technique. Monte-Carlo simu¬ 
lations could indeed extend the study to the multiple planetary 
system domain, yet the results heavily depend on the quality 
of close encounter sampling. 

With the BTS scheme, it is possible to decouple the dynam¬ 
ics of the whole system into the planetary part and star cluster 
part, and by which separate the integration of each part. To 
be specific, the star cluster dynamics can be integrated with a 
dedicated code such as NBODY 6 - 1 - 1 -. Having the resulting data 
from that stored with the BTS scheme, one could then read 
the stored data, use them to calculate perturbations and plug 
them into the planetary dynamics code. 

As an example, we implemented the BTS storage scheme 
with the HDFS'^ file format. The time series data is stored with 
the HSPart scheme (Adelmann et al. 2008), as detailed in Ta¬ 
ble 4. The simulations of planetary systems are carried out 
after star cluster simulation is performed and the results are 
stored in the HDF5 file. A certain fraction of stars with simi¬ 
lar mass are assigned with planetary systems of identical ini¬ 
tial configurations. Each planetary system is integrated with 
MERCURY 6 . As the simulation progresses, the current time t is 
converted into the the Henon time units, and the correspond¬ 
ing step in the HDF5 file is located thereby. Accelerations at 
the point where the each planet is located are calculated ac¬ 
cording to the loaded data, and subsequently be applied as 
velocity kicks. If t corresponds to the intermediate state be¬ 
tween two adjacent time steps, interpolation of (x, y, z) will be 
computed according to equation (17) to (24), such that the ac- 

We note especially that HDF5 is chosen just as an example because it is 
very prevalent and flexible, other formats such as SDF (Warren 2013) exists 
and can be used in the same way. 



Figure 8. Interpolation of the position of a given star. Without interpola¬ 
tion, the position of the star is a step-like function of time (shown with dots). 
With interpolation, the position is smoothed, allowing velocity kicks to be 
preciously evaluated (shown with solid curves). 

celeration at timescales comparable to the typical timescales 
of planets can be precisely evaluated (as demonstrated in Fig. 
8). 

Since redundant data is minimized in the BTS storage 
scheme, we could adopt very high output frequency of star 
cluster integration data while maintaining reasonable data file 
size (as shown in Fig. 3). Together with the septic spline 
interpolation technique and making full use of all available 
data in the two adjacent time steps, the velocity kicks can be 
calculated with very high accuracy and temporal resolution. 
Furthermore, we parallelize the interpolation on GPUs with 
Thrust/CUDA'®. In our simulations, the star cluster has 4000 
stars, where identical planetary systems are assigned to 1 % 
of Solar-type stars. Each planetary system contains the 4 gas 
giants in the present-day Solar System. The BTS scheme has 
a temporal resolution R, - 8 , corresponding to 256 outputs 
per Henon time unit, or roughly 10^ years per output. On 2 
Intel Xeon X565Q cores, evolving such a coupled systems 
for about 1 Myr takes about 12 hours. The code is not fully 
optimized for the purpose of benchmark, and the actual wall- 
clock time depends primarily on the frequency of communica¬ 
tion between NB0DY6++ and MERCURY 6 . The communication 
of NB0DY6++ and MERCURY 6 is implemented within the AMUSE 
framework (Portegies Zwart et al. 2013; Portegies Zwart et al. 
2009)). The scientific results of this application is presented 
primarily in Cai et al. (2015, in prep.). 

6 . 2 . Long-term Evolution of Massive Globular Cluster with 
Million Bodies 

Simulations of massive globular cluster in the regime of 
million bodies and/or million solar masses are made feasi¬ 
ble only in recent years, thanks to the exciting evolution of 
GPU-based high performance computing technology. In a re¬ 
cent research, Wang et al. (2015, in prep.) evolve a globular 
cluster with A = 1.05m (950k single stars and 50k binaries) 
for 12Gyr. With a temporal resolution R, = 3, 8 outputs are 
generated for each Henon time unit, corresponding to 475 MB 
of data. According to the time scaling of Henon time unit to 
physical time units, the total time of simulation corresponds 

http://docs.nvidia.com/cuda/thrust/ 
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Table 4 

Internal file layout of the star cluster time series data file 


Step# 

Attributes (scalar) 

Data (vectors) 

0 

to,No, ... 

V V m 

^O’yO’^0’^0 ’^0 ’^0 ’*0 ’^0 ’^0 ’^0 ’^0 ’^0 

1 

ti,Ni,- 

(1) (1) (1) (2) (2) (2) (3) (3) (3) 

xi,yi,zi,x'',y'',z' 

2 

t2, N 2 , ■■■ 

X 2 , y 2 . Z 2 , x<« z<« xf, yf\ y®. zf, m 2 .... 

n 


x„, y„. z„,x<« yW, z®, X® .'y®,z®. X® , y®, z®. m„,... 


Notes. The time series is organized as HDF5 groups, and in which vector data and scalar attributes corresponding to a given time step is grouped. 


to about 10"^ to 10^ Henon time units, depending on the total 
mass of the cluster. As such, the total data output of the BTS 
scheme is roughly 5 TB to 500 TB. 

6.3. Applications for Grid-based Simulations 

BTS-like storage schemes can also be very useful for grid- 
based simulations. Modern adaptive-mesh-refinement (AMR) 
codes, such as Enzo (Bryan et al. 2014) and GAMER (Schive et 
al. 2010), adopt the individual time step integration powered 
by GPU acceleration. The total number of refinement levels 
is typically around ten, making the evolution time steps of the 
root level and the highest refinement level differ by a factor of 
~ 1,000. It is hence impractical to store the entire snapshot at 
each sub-step. 

For example, in the cosmological simulations of wavelike 
dark matter (Schive et al. 2014), the dynamical timescale 
of the solitonic core in each dwarf galaxy is only about 50 
Myr. It hence requires ~ 1.6 x 10"^ data dumps from redshift 
one to the present day (assuming 100 dumps per dynamical 
timescale). Each full snapshot takes about 80 GB in a 1.5 
Mpc/h comoving box, with ~ 10^° cells in total. The total 
amount of data in the snapshot scheme thus consume ~ 1.3 
petabyte. For comparison, if we are mainly interested in the 
dynamical evolution of one solitonic core, we can utilize the 
BTS-like storage scheme to only output the core data more 
frequently. For a solitonic core with a radius of 1 kpc and a 
simulation resolution of 60 pc, it consumes about 160 kilo¬ 
byte for one data dump and 2.5 GB in total after redshift one. 
Accordingly, the storage requirement can be significantly re¬ 
duced by a factor of ~ 5 x 10^. 

6.4. Data Visualization 

Astronomical data take on a multitude of forms: catalogs, 
data cubes, images, and simulations (Kent 2013). Because of 
their complexity, they are usually explored using data visu¬ 
alization, which is in fact reorganization of the original data 
by graphical means. It is particularly useful to illustrate the 
dynamical evolution of A-body systems. Visualization can 
be done in various ways, from a simple 2D plot to a real¬ 
istic visual reconstruction of complicated multi-scale astro- 
physical processes. This simple idea can become challeng¬ 
ing in the context of astrophysical data because of the wide 
dynamical range and large particle number. If the data are 
stored in a “compact” fashion such that only active particles 
are recorded, as described in Section 4, then the position need 
to be interpolated using (for example) septic splines as pre¬ 
sented in Section 3.3 prior to rendering. Since each particle is 
interpolated independently, this problem is “embarrassingly 
parallel” and very suitable for GPUs (e.g. programmed in 
CUBA or OpenCL). It is common that the total number of 
particles exceeds the total number of pixels on the viewport, 
and therefore the visualization program should be adjusted to 


ft O O NBODY6++ Snapshot Viewer 



Figure 9. Visualization of NB0DY6++ snapshot with the vispy library. The 
particles are colored according to their temperature, and are sized according 
to their luminosity. 


the user’s interests. Furthermore, because of the large dynam¬ 
ical ranges, data usually have to be scaled before rendering. 
For example, the stellar mass m can range from ~ 0.1 Mq to a 
much as ~ 150 Mq; the power P emitted by a star is a strong 
function of its temperature T and radius R, as implied by the 
Stefan-Boltzmann law P cx If m or P are rendered di¬ 

rectly on the screen, then massive or bright stars are easily 
saturated, while light or faint stars are difficult to distinguish. 

In practice, it is usually not enough to recreate the evolution 
process of an A-body system by plotting only the coordinates. 
The velocity vector, mass, size, temperature, luminosity are 
then expected to be rendered as associated properties of the 
coordinates, such as color, symbol or size. As an example 
we adopt the astronomical plotting library vispy for the vi¬ 
sualization of an NB0DY6-I-I- simulation as Fig. 9 shows; as 
another example we also adopt the open source scientific vi¬ 
sualization package ParaView to visualize the mass spectrum 
of dense globular cluster simulations, as shown in Fig. 10. 

7. CONCLUSION 

We present the Block Time Step (BTS) storage scheme for 
the data management of astrophysical A-body simulations, in¬ 
spired by the individual time step integration scheme (Makino 
& Hut 1988). This is an urgent response to the ever increas¬ 
ing challenges posed by modern highly computationally ex¬ 
pensive simulations. By adopting the BTS storage scheme, 
the growth of data can be dramatically scaled down from A^ 
to (for the Plummer model). Depending on the usages 
of simulation, it is not necessary for all integration data to be 
recorded. Instead, a resolution parameter can be defined ei¬ 
ther in the space domain or in the time domain, which offers 
the flexibility of data scaling. Apart from theoretical analy- 
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Figure 10. Visualization of NB0DY6++ simulation data with ParaView. The 
figure shows a cluster with N = 5000 particles (King Model, Wo = 6.0, 
Kroupa (2001) initial mass function) The upper left panel shows an overview 
of the star cluster; the botton-left panel shows the trajectories of the particle of 
interest (POI), and in this case they are two stellar mass black holes (masses 
Ml = IOM 0 and M 2 = 2 OM 0 ). The stars are colored with their stellar types. 
The corresponding H-R diagram evolves simultaneously with the cluster is 
shown on the right panel. The data are written in HDF5 format with HSPart 
scheme, which is supported by ParaView via the built-in HSPartReader. 

sis and predictions, dedicated simulations are carried out and 
the results are consistent with the theory. Our I/O perfor¬ 
mance benchmark of the binary HDF5 and the ascii CSV for¬ 
mat shows that binary formats are generally more preferable 
to store large and complicated datasets, yet for lightweight 
datasets text files exhibit their convenience for data analysis 
and portability. A list of astrophysical quantities for particles 
with potential user interests is proposed, and some of these 
quantities are visualized with open-source packages and li¬ 
braries such as ParaView, GLnemo2 and s2plot. 

The growth of numerical simulation scales and data rates 
implies that not only computations, but also data storage, vi¬ 
sualization and analysis need to be carried out distributively. 
Open source packages currently provide strong support for 
the technical implementation of these distributed systems, yet, 
to implement them into astrophysics-driven systems, adapta¬ 
tions need to be made according to the specific astrophysical 
context. This paper therefore addresses the concerns and pos¬ 
sible solutions. Typical applications of the proposed scheme 
on astrophysical scenarios such as simulations of planetary 
systems in star cluster, cosmological grid-based simulations, 
long-term evolution of million solar masses globular clusters 
and scientific visualizations are presented as well. 

Our discussion is primarily focused on particle-based di¬ 
rect A-body simulations. The philosophy behind the pro¬ 
posed scheme is to “apply proper scaling to the simulation 
data to provide fine-grain control of the resolution of scien¬ 
tifically interesting data while suppressed the uninteresting 


ones"'. Despite the different algorithms used in other kinds 
of astrophysical simulations, such as hydrodynamics simu¬ 
lations, tree codes, adaptive mesh refinement (AMR) codes, 
Monte-Carlo simulations, and many other new algorithms un¬ 
der development, the methodology addressed in this paper is 
transferable to a wide range of scenarios. 
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APPENDIX 

CUSTOM OUTPUT SUBROUTINES FOR NB0DY6 

We implemented the HDF5 and CSV custom data format output subroutines for NB0DY6 and NBODY 6 - 1 -+, both 
for benchmark purpose and the need of long term data management. The subroutines can be downloaded from 
http://silkroad.bao.ac.cn/~maxwell/hdf5. The integration is trivial: (1) compile and install the HDF5 library from 
the source code, which can be obtained from http://www.hdfgroup.org/HDF5/release/obtainsrc.html; (2) copy the 
custom output subroutines source code custom_output. f to the Ncode directory of NB0DY6; (3) modify the Makefile to 
add the custom_output. f file into the list of source files; (4) call the subroutine by adding one line into the intgrt. f (or 
intgrt.omp.f for the GPU2 version). (5) Add a common block to the “hrplot.f” so that stellar evolution data can also be 
dumped to the output. More detailed instruction can be found in the README file of of the downloaded package. In the NB0DY6 
input file, option #46 and #47 are used to control the output file type, respectively, as shown in Table 5. 
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Table 5 

Fine-grain control of output frequency and file format for NB0DY6. 


Option 

KZC46)=1 

KZC46)=3 

KZ(46)=2 

KZC46)=4 

KZC47)=R, 


Meaning 


Output BTS data as HDF5 (active particle only) 
Output BTS data as HDF5 (all particles) 

Output BTS data as HDF5 (active particle only) 
Output BTS data as CSV (all particles) 

The output frequency is 2^' times per Henon time unit 


VISUALIZATION TECHNIQUES OF THE HDF5-BASED BTS DATA 

The HDF5 data generated by the custom output subroutines described in Appendix A can be visualized directly with ParaView. 
The HSPart reader is included in the ParaView 4.x distribution, but it is not activated by default. To enable it, users may navigate 
to the main menu and click “Tools — Manage Plugins”, and then find the HSPartReader and select “Auto Load”. After that one 
will be able to load the HDF5 simulation datasets from the Open menu. After loading the file, users may select the X, Y and Z 
arrays from the drop-down list for visualization. 

We also implemented a vi spy-based visualization script for the simulation datasets, which can be downloaded from 
http://silkroad.bao.ac.cn/~maxwell/hd£5. 
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